16S Assessment

I am largely following the DADA2 tutorial by Benjamin Callahan that can be found here.

Inspecting the data

Check samples to use

##  [1] "A10" "A11" "A12" "A2"  "A3"  "A6"  "A8"  "A9"  "B10" "B11" "B3"  "B4" 
## [13] "B5"  "B7"  "C1"  "C10" "C11" "C2"  "C4"  "G5"  "G6"



View example quality plots

Viewing the quality of 2 of the forward samples

Viewing the quality of the same 2 of the reverse samples



Filter and trim samples

Perform the filtering

You can see in above figures that the quality of reads drop off towards the end, so we need to filter out these low quality reads

##                  reads.in reads.out
## A10_16S_R1.fastq    25343     23991
## A11_16S_R1.fastq    27398     26006
## A12_16S_R1.fastq    45822     44152
## A2_16S_R1.fastq     99390     94836
## A3_16S_R1.fastq     77568     73174
## A6_16S_R1.fastq     44719     41897

After filtering out the low quality reads, we have maintained about 95.7 of the original reads.



Check the filtering

Viewing the quality of the same 2 of the forward samples post filtering and trimming

Viewing the quality of the same 2 of the reverse samples post filtering and trimming



DADA2 sample processing

Error Rates

The error rates for each possible transition (A→C, A→G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. We want the estimated error rates (black line) to be a good fit to the observed rates (points), and the error rates to drop with increased quality.



Sample Inference

We are now ready to apply the core sample inference algorithm to the filtered and trimmed sequence data.


Forward sample inference

## Sample 1 - 23991 reads in 6094 unique sequences.
## Sample 2 - 26006 reads in 6312 unique sequences.
## Sample 3 - 44152 reads in 7218 unique sequences.
## Sample 4 - 94836 reads in 13509 unique sequences.
## Sample 5 - 73174 reads in 9644 unique sequences.
## Sample 6 - 41897 reads in 7509 unique sequences.
## Sample 7 - 48504 reads in 7342 unique sequences.
## Sample 8 - 32048 reads in 7630 unique sequences.
## Sample 9 - 24910 reads in 5725 unique sequences.
## Sample 10 - 34782 reads in 5737 unique sequences.
## Sample 11 - 34374 reads in 5211 unique sequences.
## Sample 12 - 90460 reads in 8319 unique sequences.
## Sample 13 - 23566 reads in 5815 unique sequences.
## Sample 14 - 110660 reads in 10438 unique sequences.
## Sample 15 - 27927 reads in 5388 unique sequences.
## Sample 16 - 73356 reads in 7647 unique sequences.
## Sample 17 - 23594 reads in 6498 unique sequences.
## Sample 18 - 25590 reads in 6768 unique sequences.
## Sample 19 - 42054 reads in 7901 unique sequences.
## Sample 20 - 1795 reads in 421 unique sequences.
## Sample 21 - 162 reads in 63 unique sequences.

Inspecting the returned dada-class object for the first forward sample:

## dada-class: object describing DADA2 denoising results
## 113 sequence variants were inferred from 6094 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

The DADA2 algorithm inferred 144 true sequence variants from the 9515 unique sequences in the first sample. There is much more to the dada-class return object than this (see help("dada-class") for some info), including multiple diagnostics about the quality of each denoised sequence variant, but that is beyond the scope of an introductory tutorial.


Reverse sample inference

## Sample 1 - 23991 reads in 5924 unique sequences.
## Sample 2 - 26006 reads in 6092 unique sequences.
## Sample 3 - 44152 reads in 7108 unique sequences.
## Sample 4 - 94836 reads in 12281 unique sequences.
## Sample 5 - 73174 reads in 8686 unique sequences.
## Sample 6 - 41897 reads in 7149 unique sequences.
## Sample 7 - 48504 reads in 7069 unique sequences.
## Sample 8 - 32048 reads in 7283 unique sequences.
## Sample 9 - 24910 reads in 5526 unique sequences.
## Sample 10 - 34782 reads in 5706 unique sequences.
## Sample 11 - 34374 reads in 4744 unique sequences.
## Sample 12 - 90460 reads in 8211 unique sequences.
## Sample 13 - 23566 reads in 5706 unique sequences.
## Sample 14 - 110660 reads in 10194 unique sequences.
## Sample 15 - 27927 reads in 5233 unique sequences.
## Sample 16 - 73356 reads in 7528 unique sequences.
## Sample 17 - 23594 reads in 6469 unique sequences.
## Sample 18 - 25590 reads in 6315 unique sequences.
## Sample 19 - 42054 reads in 7772 unique sequences.
## Sample 20 - 1795 reads in 465 unique sequences.
## Sample 21 - 162 reads in 69 unique sequences.



Merge paired reads

We now merge the forward and reverse reads together to obtain the full denoised sequences. Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region (but these conditions can be changed via function arguments).

The mergers object is a list of data.frames from each sample. Each data.frame contains the merged sequence, its abundance, and the indices of the forward and reverse sequence variants that were merged. Paired reads that did not exactly overlap were removed by mergePairs, further reducing spurious output.



Construct sequence table

We can now construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.


## 
##  223  227  239  246  251  252  253  254  255  256  257 
##    2    1    1    1    5   46 1601   86    4    4    2

The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants. This table contains 1753 ASVs.


After viewing the distribution of read lengths, it looks like we have some that fall outside the expected range (244 - 264) so we will go ahead and remove these non target length sequences.

## 
##  246  251  252  253  254  255  256  257 
##    1    5   46 1601   86    4    4    2

This updated table now contains 1749 ASVs.



Remove chimeras

The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences.

A total of 9 bimeras were identified from the 1740 input sequences, thus retaining 100% of sequences.



Track reads through the pipeline

As a final check of our progress, we can look at the number of reads that made it through each step in the pipeline:

Genotype Treatment Input Filtered Denoised Forward Denoised Reverse Merged Nonchimera
6 OAW MP+ 25343 23991 23882 23969 23811 23811
5 OAW Control 27398 26006 25982 25971 25961 25961
5a AMB MP+ 45822 44152 44079 44127 43982 43982
10a OAW MP+ 99390 94836 94654 94676 94353 94205
5a AMB Control 77568 73174 73129 73130 73018 73018
10a AMB MP+ 44719 41897 41808 41848 41496 41496
6 AMB Control 50868 48504 48363 48414 48114 48073
5a OAW Control 33629 32048 31998 32015 31874 31842
5a OAW MP+ 26111 24910 24868 24838 24666 24641
5 AMB Control 36048 34782 34765 34754 34710 34710
2 AMB MP+ 36469 34374 34349 34353 34289 34289
2 AMB Control 93851 90460 90421 90323 90201 90201
6 AMB MP+ 24767 23566 23506 23540 23364 23328
10a OAW Control 114866 110660 110570 110599 110465 110452
5 AMB MP+ 29051 27927 27900 27896 27851 27851
2 OAW MP+ 75311 73356 73246 73255 72852 72852
6 OAW Control 24571 23594 23583 23578 23571 23571
10a AMB Control 26776 25590 25551 25561 25491 25491
5 OAW MP+ 43598 42054 42021 42021 41807 41723
NA blank_control 1918 1795 1792 1792 1792 1792
NA blank_control 169 162 153 158 153 153



Assign taxonomy

It is common at this point, especially in 16S/18S/ITS amplicon sequencing, to assign taxonomy to the sequence variants. The DADA2 package provides a native implementation of the naive Bayesian classifier method for this purpose. The assignTaxonomy function takes as input a set of sequences to be classified and a training set of reference sequences with known taxonomy, and outputs taxonomic assignments with at least minBoot bootstrap confidence.

The dada2 package GitHub maintains the most updated versions of the Silva databases., but I downloaded the databases from the associated Zenodo. The versions in this GitHub repository, used here, were last updated on 26 July 2022.

Let’s inspect the taxonomic assignments:

##      Kingdom    Phylum             Class                 Order              
## [1,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Rickettsiales"    
## [2,] "Bacteria" "Cyanobacteria"    "Cyanobacteriia"      "Chloroplast"      
## [3,] "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Pseudomonadales"  
## [4,] "Bacteria" "Myxococcota"      "Myxococcia"          "Myxococcales"     
## [5,] "Bacteria" "Campylobacterota" "Campylobacteria"     "Campylobacterales"
## [6,] "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Burkholderiales"  
##      Family             Genus         Species
## [1,] "Fokiniaceae"      "MD3-55"      NA     
## [2,] NA                 NA            NA     
## [3,] "Pseudomonadaceae" "Pseudomonas" NA     
## [4,] "Myxococcaceae"    "P3OB-42"     NA     
## [5,] NA                 NA            NA     
## [6,] "Alcaligenaceae"   "Alcaligenes" NA

Great! We can now save this and hand it off to Phyloseq for further analyses.



Remove contaminations

Remove mitochondria, chloroplasts, and non-bacteria

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1610 taxa and 21 samples ]
## sample_data() Sample Data:       [ 21 samples by 5 sample variables ]
## tax_table()   Taxonomy Table:    [ 1610 taxa by 7 taxonomic ranks ]

Removal of mitochondira, chloroplasts, and non-bacteria taxa reduced the total number of taxa from 1740 to 1610.



Remove neg control contamination

Just 3 out of the 1610 ASVs were classified as contaminants.


## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1607 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 1607 taxa by 7 taxonomic ranks ]



Blast NCBI

I have not rerun this portion since only running the pipeline with the 19 GE samples!

Looks good! We have now cleaned up the sample data to remove contamination from non target organisms and those from the negative controls.



Rarefy

Rarefy cleaned data

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1538 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 1538 taxa by 8 taxonomic ranks ]
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1514 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 1514 taxa by 8 taxonomic ranks ]



Trim underrepresented OUTs

## [1] "samples with counts below z-score -2.5 :"
## character(0)
## [1] "zscores:"
## named numeric(0)
## [1] "OTUs passing frequency cutoff  1e-04 : 416"
## [1] "OTUs with counts in 0.02 of samples:"
## 
## TRUE 
##  416
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 416 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 416 taxa by 8 taxonomic ranks ]



Rarefy trimmed data

## named numeric(0)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 416 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 416 taxa by 8 taxonomic ranks ]



Microbiome diversity analyses

Alpha metrics

Notes from phyloseq author Visualize alpha-diversity - Should be done on raw, untrimmed dataset


Simpson diversity


Shannon diversity



Core vs. accessory microbiome

Core

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 10 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 10 taxa by 8 taxonomic ranks ]

Core stats

Accessory

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 406 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 5 sample variables ]
## tax_table()   Taxonomy Table:    [ 406 taxa by 8 taxonomic ranks ]

Accessory stats

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE
## Analysis of Variance Table
## 
## Response: Distances
##           Df    Sum Sq    Mean Sq F value Pr(>F)
## Groups     3 0.0002995 0.00009984   0.117 0.9487
## Residuals 15 0.0127979 0.00085319
## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##           Df    Sum Sq    Mean Sq     F N.Perm Pr(>F)
## Groups     3 0.0002995 0.00009984 0.117    999  0.933
## Residuals 15 0.0127979 0.00085319                    
## 
## Pairwise comparisons:
## (Observed p-value below diagonal, permuted p-value above diagonal)
##             AMB Control AMB MP+ OAW Control OAW MP+
## AMB Control             0.60700     0.52400   0.791
## AMB MP+         0.59848             0.94400   0.762
## OAW Control     0.54900 0.93876               0.827
## OAW MP+         0.80260 0.79523     0.83516

## 
## Call:
## adonis(formula = seq.acc ~ Treatment, data = samdf.acc, permutations = 999) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)
## Treatment  3    1.0761 0.35869 0.79026 0.13648  0.993
## Residuals 15    6.8083 0.45388         0.86352       
## Total     18    7.8843                 1.00000



Data visualization


Preliminary figures

Taking phyloseq data and making some preliminary visualizations based on DADA2 tutorial:



Principal component analyses

All microbiome


Core microbiome


Accessory microbiome



Shannon diversity



Simpson diversity



ITS2 Assessment

Abundance by treatment and genotype

Figure SXX. Relative abundance of major ITS2 types by coral fragment (x axis) and treatment (facets). Light green represent Symbiodinium spp. (A3) and dark green represents Breviolum spp. (B2).



Abundance by treatment only

Figure SXX. Relative abundance of major ITS2 types grouped by treatment (n = 4-5 corals per treatment). Light green represent Symbiodinium spp. (A3) and dark green represents Breviolum spp. (B2).



Manuscript Materials

Methods

Bacterial and Symbiodiniaceae metabarcoding

We characterized the coral microbial and Symbiodiniaceae community composition using 16S and ITS2 metabarcoding, respectively, to target the V4/V5 region following (43). Pooled libraries were sequenced on an Illumina Miseq (paired-end 250 bp) at Tufts Sequencing facility. P

16S and ITS2 data were pre-processed in bbmap (36) and cutadapt (37) to remove primer sequences. DADA2 (38) truncated reads, calculated error rates, de-duplicated reads, inferred sequence variants, merged paired reads, and removed bimeras (38) (see table SXX for read counts through processing).

16S taxonomy was assigned using the Silva v132 dataset (44) and ITS2 taxonomy was assigned through submission to Symportal (CITE). Using phyloseq (41), ASVs assigning to family “Mitochondria”, order “Chloroplast”, or those failing to assign to kingdom “Bacteria” were removed. Using vegan (29), the ASV table was rarefied to 12,000 reads, which retained ~90% of samples (9 samples removed; Table S2). Phyloseq (41) calculated three diversity metrics: ASV richness, Shannon index, and inverse of Simpson’s index. ASV richness and inverse Simpson index were log-transformed and a one-way ANOVA with Tukey HSD tests compared diversity metrics across sites and reef zones. MCMC.OTU (42) then trimmed ASVs representing <0.01% of counts or only present in one sample. Linear regressions assessed correlations between alpha diversity metrics and colony size. Sample dissimilarity was evaluated for the total bacterial community as described for ITS2. These analyses were then repeated on 16S data separated by microbiome (45) into core (present in >70% of samples) and accessory components, and finally repeated using unrarefied, relative ASV abundances to confirm results.

metabarcoding

Methods text goes here



Results

Results text goes here



Figures



Supplemental figures



Supplemental tables



Session Information

All code was written by Colleen B. Bove, feel free to contact with questions.

Session information from the last run date on 01 August 2022:

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] awtools_0.2.1               microbiome_1.8.0           
##  [3] ggpubr_0.4.0                MCMC.OTU_1.0.10            
##  [5] MCMCglmm_2.33               ape_5.6-1                  
##  [7] coda_0.19-4                 Matrix_1.3-4               
##  [9] janitor_2.1.0               vegan_2.5-7                
## [11] lattice_0.20-45             permute_0.9-7              
## [13] plotly_4.10.0               RColorBrewer_1.1-3         
## [15] decontam_1.6.0              phyloseq_1.30.0            
## [17] kableExtra_1.3.4            ShortRead_1.44.3           
## [19] GenomicAlignments_1.22.1    SummarizedExperiment_1.16.1
## [21] DelayedArray_0.12.3         matrixStats_0.62.0         
## [23] Biobase_2.46.0              Rsamtools_2.2.3            
## [25] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
## [27] Biostrings_2.54.0           XVector_0.26.0             
## [29] IRanges_2.20.2              S4Vectors_0.24.4           
## [31] BiocParallel_1.20.1         BiocGenerics_0.32.0        
## [33] forcats_0.5.1               stringr_1.4.0              
## [35] dplyr_1.0.8                 purrr_0.3.4                
## [37] readr_2.1.2                 tidyr_1.2.0                
## [39] tibble_3.1.7                ggplot2_3.3.6              
## [41] tidyverse_1.3.1             dada2_1.20.0               
## [43] Rcpp_1.0.9                  knitr_1.33                 
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1           backports_1.4.1        systemfonts_1.0.2     
##   [4] plyr_1.8.7             igraph_1.2.11          lazyeval_0.2.2        
##   [7] splines_3.6.3          crosstalk_1.2.0        digest_0.6.29         
##  [10] foreach_1.5.2          htmltools_0.5.2        fansi_1.0.3           
##  [13] magrittr_2.0.3         cluster_2.1.2          tzdb_0.2.0            
##  [16] modelr_0.1.8           RcppParallel_5.1.5     svglite_2.1.0         
##  [19] jpeg_0.1-9             colorspace_2.0-3       rvest_1.0.1           
##  [22] textshaping_0.3.6      haven_2.4.3            xfun_0.29             
##  [25] crayon_1.5.1           RCurl_1.98-1.7         jsonlite_1.7.2        
##  [28] survival_3.2-13        iterators_1.0.14       glue_1.6.2            
##  [31] gtable_0.3.0           zlibbioc_1.32.0        webshot_0.5.2         
##  [34] car_3.1-0              Rhdf5lib_1.8.0         abind_1.4-5           
##  [37] scales_1.2.0           DBI_1.1.3              rstatix_0.7.0         
##  [40] viridisLite_0.4.0      htmlwidgets_1.5.4      httr_1.4.3            
##  [43] ellipsis_0.3.2         farver_2.1.1           pkgconfig_2.0.3       
##  [46] sass_0.4.0             dbplyr_2.1.1           deldir_1.0-6          
##  [49] utf8_1.2.2             labeling_0.4.2         tidyselect_1.1.1      
##  [52] rlang_1.0.4            reshape2_1.4.4         munsell_0.5.0         
##  [55] cellranger_1.1.0       tools_3.6.3            cachem_1.0.6          
##  [58] cli_3.3.0              generics_0.1.3         ade4_1.7-18           
##  [61] broom_1.0.0            evaluate_0.15          biomformat_1.14.0     
##  [64] fastmap_1.1.0          ragg_1.1.3             yaml_2.3.5            
##  [67] fs_1.5.2               nlme_3.1-155           xml2_1.3.3            
##  [70] compiler_3.6.3         rstudioapi_0.13        png_0.1-7             
##  [73] ggsignif_0.6.3         reprex_2.0.1           bslib_0.4.0           
##  [76] stringi_1.7.8          highr_0.9              cubature_2.0.4.2      
##  [79] tensorA_0.36.2         multtest_2.42.0        vctrs_0.4.1           
##  [82] pillar_1.8.0           lifecycle_1.0.1        jquerylib_0.1.4       
##  [85] cowplot_1.1.1          data.table_1.14.2      bitops_1.0-7          
##  [88] corpcor_1.6.10         R6_2.5.1               latticeExtra_0.6-30   
##  [91] hwriter_1.3.2.1        gridExtra_2.3          codetools_0.2-18      
##  [94] MASS_7.3-55            assertthat_0.2.1       rhdf5_2.30.1          
##  [97] withr_2.5.0            GenomeInfoDbData_1.2.2 mgcv_1.8-38           
## [100] hms_1.1.1              grid_3.6.3             rmarkdown_2.13        
## [103] snakecase_0.11.0       carData_3.0-5          Rtsne_0.15            
## [106] lubridate_1.8.0        interp_1.0-33